Week 8 - Support Vector Machines¶

Dr. David Elliott¶

1.6. Support Vector Classifier (SVC)

1.7. Support Vector Machine (SVM)

1.8. Exercises

TODO

  • double check if I should be using capital X... probably should

We can't always perfectly separate the data with a $K − 1$ dimensional hyperplane. To overcome this problem we could either:

  • tweak the constraints on the hyperplane to allow some points to be misclassified (soft margin),
  • transform the data to be separable by a hyperplane in another space (kernel method).

1.6. Support Vector Classifier (SVC) ¶

SVC's are a generalisation and extension of the maximal margin classifier so it can be applied to a broader range of cases1.

In practice they are more robust to individual observations and better classify most training observations than the Maximal Margin Classifier. This is because they take the approach it is better to missclassify some training examples in order to do a better job classifying the rest.

This is called a soft margin as it allows some violations by the training data by a small subset of training observation, not only on the wrong side of the margin, but wrong side of the hyperplane.

Notes

  • "Developed in the computer science community in the 1990s"2

We want to relax the constraints

$x_i \cdot w + b \geq +1 \quad \text{for} \ y_i = +1$,

$x_i \cdot w + b \leq -1 \quad \text{for} \ y_i = -1$,

when necessary.

This can be done firstly by introducing positive slack variables $\xi_i, i = 1, \ldots , l$ in the constraints5, which then become6:

$x_i \cdot w + b \geq +1 - \xi_i \quad \text{for} \ y_i = +1$,

$x_i \cdot w + b \leq -1 + \xi_i \quad \text{for} \ y_i = -1$,

$\xi_i \geq 0 \quad \forall_i$.

Notes

  • $\sum_i\xi_i$ is an upper bound on the number of training errors
  • $\xi_i$ tells us where the $i$th observation is located relative to the hyperplane; $\xi_i = 0$ being on the correct side of the margin, $\xi_i > 0$ being on the wrong side of the margin, and $\xi_i > 1$ on the wrong side of the hyperplane.
  • $\xi_1 = ... = \xi_n = 0$ is the maximal margin hyperplane optimisation.
  • slack variable $\xi_1,..., \xi_n$ allow individual observations to be on the wrong side of the margin or hyperplane.
  • test observations are classified as before, $f(x^*) = \beta_0 + \beta_1x^*_1 + ... + \beta_px^*_p$

Tuning Parameter (C)¶

To ensure there is a penelty, $C$, for relaxing the constraint, we can change our objective function to be minimised from $\frac{||w||^2}{2}$ to,

$\frac{||w||^2}{2}+C(\sum_i\xi_i)^k$.

If we set $k=1$ neither the $\xi_i$ or their Lagrange multipliers appear in the Wolfe dual problem. This means we now have6:

$L_D \equiv \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_jx_i\cdot x_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i\alpha_iy_i = 0$.

This also has the same solution as before:

$w = \sum^{N_S}_{i=1}\alpha_iy_ix_i$.

  • the $\alpha_i$ now have an upper bound of $C$.

  • In Sci-kit learn "the strength of the regularization is inversely proportional to C"

$C$ is a tuning parameter that controls the bias-variance trade-off.

When $C$ is small we have narrow margins rarely violated, but highly fit to the training data (low bias-high variance). Coversely, when larger, the margin is wider amounting to less hard fitting (high bias-low variance).

Like most hyper-parameters, it is often chosen using cross-validation.

Alike to maximal margin classifiers, SVC's only rely on a few observations, those on the margin or those that violate the margin (Support Vectors) - if they are on the correct side of the margin they dont change the classifier. This does mean that they are robust to observations far away from the hyperplane.

TODO

  • refs for above
  • [Insert a line about the bias-variance trade-off]

It is common to define $C$ as $C = \frac{1}{\nu N}$, where $0 < \nu \leq 1$ controls the fraction of misclasified points during the training phase 7.

1.7. Support Vector Machine ¶

Aims to address the situation where the boundry between two classes is not linear.

We could consider enlarging the feature space using quadratic, cubic or higher-order polynomial functions, however the larger the number of features, the higher computational burden.

Instead it is common to enlarge the feature space using an extension of a SVC termed a Support Vector Machine, which uses kernels.

  • A hyperplane does not need to be linear as the input feature space can be projected to higher dimensions using a kernel (e.g. radial basis kernel2,3), allowing a hyperplane to be fitted to split the data into classes. The data can then be mapped back into the original feature space to create a nonlinear separation boundary.

  1. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers, (3), 326-334.
  2. Varsavsky, A., Mareels, I., & Cook, M. (2016). Epileptic seizures and the EEG: measurement, models, detection and prediction. CRC Press.
Polynomial Feature Engineering
1.46 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Polynomial Kernel
1.1 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The Kernel trick can relies on the fact we can define our SVM in the form of dot products, $x_i\cdot x_j$.

Imagine we had a mapping $\Phi$ which maps the data to some high dimensional Euclidean space

$\Phi: \mathbb{R}^d \mapsto H$,

then we would still be using dot products in $H$. However we could instead use a kernel function

$K(x_i,x_j) = \Phi(x_i)\cdot \Phi(x_j)$,

meaning we don't need to worry about $\Phi$; saving us computation.

We still have all the same considerations, but replacing $x_i\cdot x_j$ with $K(x_i, x_j)$ allows us to produce a SVM in infinite dimensional space.

In the test phase, instead of test point computing the sign of $x^*$ with $w$, we can use the support vectors, $s_i$:

$f(x) = \sum_{i = 1}^{N_S}\alpha_iy_i\Phi(s_i)\cdot \Phi(x) +b = \sum_{i = 1}^{N_S}\alpha_iy_iK(s_i,x) +b$,

avoiding computing $\Phi(x)$.


Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.

Notes

  • $w$ lives in $H$ ("high dimensional")
  • $H$ is often refered to as a Hilbert space.
  • "it is clear that the above implicit mapping trick will work for any algorithm in which the data only appear as dot products (for example, the nearest neighbor algorithm). This fact has been used to derive a nonlinear version of principal component analysis by (Scholkopf, Smola and Muller, 1998b); it seems likely that this trick will continue to find uses elsewhere." Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.

Another Description

The kernel apporach is an efficient computational approach to enlarge our feature space to accommodate a non-linear boundary.

Skipping over some technical details (see REF), it turns out we can use the inner products of two observations rather than using the observations themselves:

$<x_i,x_{i^{\prime}}> = \sum^P_{j=1}x_{ij}x_{i^{\prime}j}$.

Using this we can represent the linear support vector classifier as:

$f(x) = \beta_0 + \sum^n_{i=1}\alpha_i<x,x_i>$.

In the above case, for estimating the parameters $\alpha_1...,\alpha_n$ and $\beta_0$, we need the $n(n-1)/2$ inner products $<x,x_i>$ between all pairs of training observations. Similarly, if wanted to compute $f(x)$ we would need to the inner product between $x$ and each training point $x_i$.

However, $\alpha$ is nonzero only for support vectors, so if we have a collection of indicies of these support points we can do the following instead:

$f(x) = \beta_0 + \sum_{i\in S}\alpha_i<x,x_i>$.

Also instead of actually calculating the inner product, we could instead use a generalisation, $K(x,x_{i^{\prime}})$, where $K$ is a kernel. We can now define the classifier as:

$f(x) = \beta_0 + \sum_{i\in S}\alpha_iK(x,x_i)$.

A kernel is a function that quantifies the similarity of two observations. For example, for a linear kernel we could use:

$K(x_i, x_{i'}) = \sum^P_{j=1}x_{ij}x_{i'j}$,

where we quantifiy the similarity of pairs of observations using Pearson (standard) correlation.

However, we could use other forms of kernel to fit the support vector classifier in a higher-dimensional space, such as a polynomial kernel:

$K(x_i, x_{i'}) = (1+\sum^P_{j=1}x_{ij}x_{i'j})^d$,

where d is a positive integer.

[MORE INFO ON polynomial kernels]

TODO

  • below has pretty much already been covered so take bits out and move into relevent section in the other notebook
  • add in new explanation of kernels from the published work I have been using in the other notebook

When using a SVM model trained using a RBF kernel classifies a test observation $x^* = (x^*_1...x^*_p)T$, only training observaations close to $x^*$ (in terms of Euclidean distance) will play a role in its class label. This is because $(x^*_j-x_{ij})^2$ will be large, so $exp(-\gamma\sum^P_{j=1}(x^*_j-x_{ij})^2)$ will be small.

[MORE INFO ON RBF kernels]

[MORE INFO ON HOW C AND GAMMA INTERACT]

TODO

  • Include notes from "choosing c" on pg. 506 from ML a probabilistic perspective.
Phi(-1.0, -2) = [0.74081822]
Phi(-1.0, 1) = [0.30119421]

Advantages¶

"It is this fact that allows us to construct hyperplanes in these very high dimensional spaces yet still be left with a tractable computation. Thus SVMs circumvent both forms of the “curse of dimensionality”: the proliferation of parameters causing intractable complexity, and the proliferation of parameters causing overfitting." Burges (1998)

Disadvantages¶

"Perhaps the biggest limitation of the support vector approach lies in choice of the kernel. Once the kernel is fixed, SVM classifiers have only one user-chosen parameter (the error penalty), but the kernel is a very big rug under which to sweep parameters. Some work has been done on limiting kernels using prior knowledge (Scholkopf et al., 1998a; Burges, 1998), but the best choice of kernel for a given problem is still a research issue." Burges (1998)

"A second limitation is speed and size, both in training and testing. While the speed problem in test phase is largely solved in (Burges, 1996), this still requires two training passes. Training for very large datasets (millions of support vectors) is an unsolved problem" Burges (1998)

"Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained (Joachims, 1997)." Burges (1998)

Associated Exercises¶

Now might be a good time to try exercises X-X.

References¶

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
  2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  3. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".
  4. Raschka, 2016
  5. Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20:273–297, 1995
  6. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
  7. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

web1. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html web2. https://scikit-learn.org/stable/datasets/toy_dataset.html